Skip to content

BUG: fix categorical comparison with missing values (#26504 ) #26514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Jun 1, 2019

Conversation

another-green
Copy link
Contributor

@another-green another-green commented May 24, 2019

This PR fixes issues comparison of ordered categorical with missing values evaluates to True(#26504 ). Now making comparison with missing values always gives False. The missing values could be introduced when constructing a Categorical with values that don't fall in with the categories provided.

@pep8speaks
Copy link

pep8speaks commented May 24, 2019

Hello @yanglinlee! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-01 00:32:57 UTC

@codecov
Copy link

codecov bot commented May 24, 2019

Codecov Report

Merging #26514 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26514      +/-   ##
==========================================
- Coverage   91.76%   91.75%   -0.01%     
==========================================
  Files         174      174              
  Lines       50679    50684       +5     
==========================================
  Hits        46504    46504              
- Misses       4175     4180       +5
Flag Coverage Δ
#multiple 90.26% <100%> (ø) ⬆️
#single 41.71% <100%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.93% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.6% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b730ab3...cf156cf. Read the comment docs.

@codecov
Copy link

codecov bot commented May 24, 2019

Codecov Report

Merging #26514 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26514      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         174      174              
  Lines       50644    50647       +3     
==========================================
- Hits        46515    46514       -1     
- Misses       4129     4133       +4
Flag Coverage Δ
#multiple 90.37% <100%> (ø) ⬆️
#single 41.69% <57.14%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.92% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6a7cc1...d83c4f4. Read the comment docs.

@gfyoung gfyoung added Categorical Categorical Data Type Bug labels May 24, 2019
@gfyoung
Copy link
Member

gfyoung commented May 24, 2019

@yanglinlee : Thanks for submitting this! Can you also add a whatsnew entry to v0.25.0.rst?

@another-green
Copy link
Contributor Author

@yanglinlee : Thanks for submitting this! Can you also add a whatsnew entry to v0.25.0.rst?

Sure! Done~

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to be clear this has nothing to do with the None value right? Isn't this generally applicable to values that don't fall into a category and if so can you update whatsnew / commentary and tests to reflect that?

@WillAyd WillAyd added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label May 25, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am puzzled of the purpose of this PR. We don't' allow any comparison of categorical except for eq / ne for non-ordered. I would be ok with simply raising the same TypeError message that we already do. But what is the purpose of this expansion?

@another-green
Copy link
Contributor Author

I am puzzled of the purpose of this PR. We don't' allow any comparison of categorical except for eq / ne for non-ordered. I would be ok with simply raising the same TypeError message that we already do. But what is the purpose of this expansion?

Just want to get some clarification:
Scenario 1: Missing values.

In [50]:
cat = pd.Categorical([1, 2, 3, None], categories=[1, 2, 3], ordered=True) 
cat
Out[50]:
[1, 2, 3, NaN]
Categories (3, int64): [1 < 2 < 3]
In [51]: 
cat <=2
Out[51]:
array([ True,  True, False,  True])

If we allow missing values in the array, then the result should not be raising errors.
People suggest the result in Out[51] should be array([ True, True, False, False])

Scenario 2: Values that don't fall into a category as WillAyd suggested.

In [54]:
cat2 = pd.Categorical([1, 2, 3, 999], categories=[1, 2, 3], ordered=True) 
cat2
Out[54]:
[1, 2, 3, NaN]
Categories (3, int64): [1 < 2 < 3]
In [55]:
cat2<=2
Out[55]:
array([ True,  True, False,  True])

Currently it doesn't raise TypeError.

If you want to raise TypeError for both scenarios, I can change the behavior of the comparison with scalar. Let me know your suggestions. Thanks.

@jreback
Copy link
Contributor

jreback commented May 26, 2019

if the above scenarios compare with a listlike they raise i believe
a scalar should not be different

@another-green
Copy link
Contributor Author

if the above scenarios compare with a listlike they raise i believe
a scalar should not be different

I tested comparison with listlike, and it didn't raise TypeError. It only takes comparison with None as False.

cat1 = pd.Categorical([1,2,3,999], categories=[1,2,3], ordered=True)
cat2 = pd.Categorical([2,2,2,2], categories=[1,2,3], ordered=True)
cat1
Out[]:
[1, 2, 3, NaN]
Categories (3, int64): [1 < 2 < 3]
In[]:
cat1 >= cat2
Out[]:
array([False,  True,  True, False])

In this case, do we need to change the behavior of comparison to listlike? So in both cases, the comparison to None will raise Error.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 28, 2019

I am puzzled of the purpose of this PR.

The purpose is quite clear to me: fixing a bug. Which is clearly described in the issue: #26504

The bug is that when doing a comparison (which is allowed for ordered categoricals), missing values always evaluate to True or False (depending on which operations, because they are encoded as -1 and we use that value in the comparison).
While comparisons to NaN should always be False.

To show this in another way:

In [49]: cat = pd.Categorical([1, 2, 3, None], categories=[1, 2, 3], ordered=True)

In [50]: pd.Series(cat)           
Out[50]: 
0      1
1      2
2      3
3    NaN
dtype: category
Categories (3, int64): [1 < 2 < 3]

In [51]: pd.Series(cat) < 2    
Out[51]: 
0     True
1    False
2    False
3     True
dtype: bool

In [52]: pd.Series(cat, dtype=object) < 2                     
Out[52]: 
0     True
1    False
2    False
3    False
dtype: bool

The object dtype has the correct behaviour: a missing value gives False in the result.

So to be clear this has nothing to do with the None value right?

This is not about "None", or also not about values that don't fall within the category, but is is about missing values (but both None as non-existent categories when constructing the categorical lead to missing values in the actual categorical)

@jorisvandenbossche jorisvandenbossche changed the title BUG-#26504 Fix : None comparison evaluates to True BUG: fix categorical comparison with missing values (#26504 ) May 28, 2019
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good to me, thanks for the PR!

Can you address the comment of @gfyoung about the whatsnew message?

@jreback
Copy link
Contributor

jreback commented May 28, 2019

this needs to raise for non ordered categoricals

that is not addressed in this PR as there are no tests for this

furthermore the behavior for listlikes must also be the same (raise for non ordered)
if u r making work for scalars then it needs to work for listlikes as well (for ordered)

@TomAugspurger
Copy link
Contributor

this needs to raise for non ordered categoricals

that is not addressed in this PR as there are no tests for this

I think the issue was about ordered Categoricals. unordered does indeed raise

In [3]: import pandas as pd
   ...: pd.Categorical(["1", "2", "3", None], categories=["1", "2", "3"]) <= "2"
   ...:
   ...:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-1804d8ae703c> in <module>
      1 import pandas as pd
----> 2 pd.Categorical(["1", "2", "3", None], categories=["1", "2", "3"]) <= "2"
      3

~/sandbox/pandas/pandas/core/arrays/categorical.py in f(self, other)
     64         if not self.ordered:
     65             if op in ['__lt__', '__gt__', '__le__', '__ge__']:
---> 66                 raise TypeError("Unordered Categoricals can only compare "
     67                                 "equality or not")
     68         if isinstance(other, Categorical):

TypeError: Unordered Categoricals can only compare equality or not

@jorisvandenbossche
Copy link
Member

this needs to raise for non ordered categoricals

This does already raise, that is not the issue.

furthermore the behavior for listlikes must also be the same. if u r making work for scalars then it needs to work for listlikes as well (for ordered)

This does already work for listlikes. Actually this PR basically copied the code from the code branch for list-likes to the code branch of scalars.

@WillAyd
Copy link
Member

WillAyd commented May 28, 2019

This is not about "None", or also not about values that don't fall within the category, but is is about missing values (but both None as non-existent categories when constructing the categorical lead to missing values in the actual categorical)

Gotcha - so in both in the original issue and here in the whatsnew we keep referring to None as being of importance but that is distracting from the actual issue.

@yanglinlee looks like you might have some other changes to the whatsnew but can you account for that on next edit? Ping back if you are stuck on wording

@jorisvandenbossche
Copy link
Member

Gotcha - so in both in the original issue and here in the whatsnew we keep referring to None as being of importance but that is distracting from the actual issue.

Yes, it was by using "None" instead of "missing values" that caused some confusion I think. I already changed the title of this PR.

@another-green
Copy link
Contributor Author

Thank everyone for the helpful discussions! This is great suggestion! I will update the PR after work tonight.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanglinlee Thanks for the updates. Added a few more comments.

@jreback jreback added this to the 0.25.0 milestone May 30, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm outside of @jreback comments

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what happened with the last CI run. Try merging master and pushing again.

@TomAugspurger
Copy link
Contributor

@yanglinlee CI is being fixed in #26586. Merge master & repush once that's in.

@jreback jreback merged commit 9ebbe1b into pandas-dev:master Jun 1, 2019
@jreback
Copy link
Contributor

jreback commented Jun 1, 2019

thanks @yanglinlee

vaibhavhrt pushed a commit to vaibhavhrt/pandas that referenced this pull request Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: ordered categorical comparison with missing values evaluates to True